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Adaptive Multiple Importance Sampling for Gaussian 
Processes 


Xiaoyu Xiong • Vaclav Smidl • Maurizio 
Filippone 


Abstract In applications of Gaussian processes where quantification of uncertainty 
is a strict requirement, it is necessary to accurately characterize the posterior distri¬ 
bution over Gaussian process covariance parameters. Normally, this is done by means 
of standard Markov chain Monte Carlo (MCMG) algorithms. Motivated by the is¬ 
sues related to the complexity of calculating the marginal likelihood that can make 
MGMC algorithms inefficient, this paper develops an alternative inference frame¬ 
work based on Adaptive Multiple Importance Sampling (AMIS). This paper studies 
the application of AMIS in the case of a Gaussian likelihood, and proposes the 
Pseudo-Marginal AMIS for non-Gaussian likelihoods, where the marginal likelihood 
is unbiasedly estimated. The results suggest that the proposed framework outper¬ 
forms MCMC-based inference of covariance parameters in a wide range of scenarios 
and remains competitive for moderately large dimensional parameter spaces. 

Keywords Gaussian processes ■ Bayesian inference ■ Markov chain Monte Carlo • 
Importance sampling 


1 Introduction 


Gaussian Processes (GPs) have been proved to be a successful class of statistical in¬ 
ference methods for data analysis in several applied domains, such as pattern recog - 
nition ( Rasmus sen and Williamsll200(il: iBishorJIioOTl : iFilin none and Girol amill20l4 l. 
neuroimaging ([Filj)£oneetalj j2012l '). signal processing (|Kim et al.l 20141 1 . Bayesian 
optimization ( Jones et al.l 199a l. and emulation and calibration of computer codes 
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( Kennedy and O’Haganl[2r)0ll ~l. The features that make GPs appealing are the non- 
parametric formulation that yields the possibility to flexibly model data, and the 
parameterization that makes it possible to gain some understanding on the system 
under study. These properties hinge on the parameterization of the GP covariance 
function and on the way GP covariance parameters are optimized or inferred. 

It is established that optimizing covariance paramet ers can severely affect the _ 

ability of the model to quantify uncertainty in predic tions ( NeaJl9^ : lFilippone and GirolaiJ 
l2014l : lMlippone et al.ll201^ : lTavlor and Digglei201 j l. Therefore, in applications where 
this is undesirable, it is necessary to accurately characterize the posterior distribu¬ 
tion over covariance parameters and propagate this source of uncertainty forward to 
predictions. This task is particularly challenging when dealing with GPs. Inference of 
GP covariance parameters is generally analytically intractable, but a further compli¬ 
cation arises from the difficulties associated with the repeated computation of the so 
called marginal likelihood, which is necessary when employing any standard inference 
method. In particular, in the case of a Gaussian likelihood, the marginal likelihood is 
computable but extremely costly due to the cubic scaling with the number of input 
vectors. When the likelihood function is not Gaussian, e.g., in classification, in ordi¬ 
nal regression, or in Gox-processes, the marginal likelihood is not even computable 
analytically. 

In response to the challen ges above, a large body of the literature develops ap¬ 
proximate inference metho ds (|Wilhimis_andJ3arba ^9^ 200fl 


Kussand_^nsmussOT 

20081 : Hensman et al 


2005: [Rasmussen and Williams 200£; Nickisch and RasmussenI 


2013) j which, although successful in many cases, give no guar¬ 


antees on the effect on the quantification of uncertainty in practice. In the direction of 
quantify uncertainty without introducing any bias, in the literature there have been 
attempts to employ Markov chain Monte Carlo (MCMC) techniques; we can br oadly 
divide such attempts in works that propose reparameterization techniques (jNeall 
I 1999 I : [Murray and Adamsll201ol : [Vanhatalo and Vehtaril[2007l : [Filippone et al.l[201.lh . 
or method s that carry out inference based on unbiased computations of the marginal 


likelihood ( Filippone and Girolami20l3 : lFilipponel20l3 : lMurrav and Graham I2OI5I '). 
Although these approaches proved successful in a variety of scenarios, employing 
MCMG algorithms may lead to inefficiencies; for instance, optimal acceptance rates 
for popular MCMC algorithms s uch as the Metropolis-Hastings (MH) algorithm 
(around 25% ([Ro berts et ^ 1997)_)and the Hybrid Monte Carlo (HMC) algorithm 
(about 65% ( Beskos et al. 201,81 : N^[201l[ ~)') indicate that several expensive compu¬ 
tations are wasted. Furthermore, it is established that introducing adaptivity into 
MCMC proposal mechanis ms to improve efficiency may lead to convergence issues 
( Andrieu and Robertll200i] ~). 

In this paper we develop a general framework to learn GPs aimed at overcoming 
the aforementioned limitations of MCMC methods for GPs, where expectations un¬ 
der the posterior distribution over covariance parameters are carried out by means 
of th e Adaptive Multiple Importance Sampling (AMIS) algorithm ( Cornuet et al.l 
[ 2 OI 2 I '). The application of this framework to the Gaussian likelihood case, although 
novel, is relatively straightforward given that the likelihood is computable. In the 
case of non-Gaussian likelihoods, the impossibility to compute the likelihood ex¬ 
actly, motivates us to propose a novel version of AMIS where the likelihood is 
only (unbiasedly) estimated . Inspired by the Pseudo-Marginal MCMC approaches 
d Andrieu and Robertsl[2009l '). we therefore propose the Pseudo-Marginal AMIS (PM- 
AMIS) algorithm, and provide a theoretical analysis showing under which conditions 
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PM-AMIS yields expectations under covariance parameters without introducing any 
bias. The propos ed PM-AMIS is an instance of th e Importance Sampling squared 
(IS^) algorithms I Pitt et ^l2012l : iTran et al.l 120141 ') that are gaining popularity as 
practical Bayesian inference methods. 

Summarizing, the main contribution of this work are: (i) the application of AMIS 
to learn GPs with any likelihoods; (ii) a theoretical analysis of PM-AMIS; (iii) an 
extensive comparison of convergence speed with respect to computational complexity 
of AMIS versus MCMC methods. 

The results demonstrate the value of our proposal. In particular, the results 
indicate that AMIS is competitive with MCMC algorithms in terms of convergence 
speed over computational cost when calculating expectations under the posterior dis¬ 
tribution over covariance parameters. Furthermore, the results suggest that AMIS 
is a valid alternative to MCMC algorithms even in the case of moderately large di¬ 
mensional parameter spaces, which is common when employing richly para meterized 
covar iances (e.g., Automatic Relevance Determination (ARD) covariances ( MacKa'^ 
1 19941 ')'). Overall, the results suggest a promising direction to speedup inference over 
GP covariance parameters given that importance sampling-based inference methods, 
unlike MCMC algorithms, are inherently parallel. 

The paper is organised as follows. In section 2 we provide a brief review of GP 
regression and Bayesian inference. Section 3 presents the proposed Adaptive Multiple 
Importance Sampling for Gaussian Processes. Section 4 reports the experiments and 
the results. In section 5, we conclude the paper. 


2 Bayesian Gaussian Processes 

2.1 Gaussian Processes 

Let X be a set of n input vectors e IR'^(1 < i < n) and let y be the vector con¬ 
sisting of the corresponding observations yi. In most GP models, the assumption 
is that observations are conditionally independent given a set of n latent vari¬ 
ables. Such latent variables are modeled as realizations of a function /(x) at the 
input vectors xi,...,x„, i.e., f = {/(xi),...,/(x„)}. Latent variables are used to ex¬ 
press the likelihood function, that under the assumption of independence becomes 
P(y I f) = I /i)i ■with p{yi I fi) depending on the particular type of data 

being modeled (e.g., Gaussian for regression, Bernoulli for probit classification with 
probability P{yi = 1) = <l>(/(xi)) where <1> is defined as the cumulative normal dis¬ 
tribution). 

'What characterizes GP models is the way the latent variables are specified. In 
particular, in GP models the assumption is that the function /(x) is distributed 
as a GP, which implies that the latent function values f are jointly distributed as 
a Gaussian p(f | 6 ) ~ A/'(0,K), where K is the covariance matrix. The entries of 
the covariance matrix K are specified by a covariance (kernel) function with hyper¬ 
parameters 6 between pair of input vectors. In this work, two kinds of covariance 
function are considered. The first is the RBF (Radial Basis Function) kernel defined 
as: 


fc(xi,Xj) = crexp 


( 1 ) 
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The parameter t defines the length-scale of the interaction between the inpnt vectors, 
a represents the marginal variance for each latent variable. The second is the ARD 
kernel, which takes the form: 


fc(Xi,Xj) = aexp i 


( 2 ) 


The advantage of the ARD kernel is that it acconnts for the infinence of each feature 
on the prediction of y, with larger value s of parameters ( ti , ...^Td) indicating a higher 
influence of the corresponding features ( Kim et ^l2014l b For simplicity of notation, 
in the remainder of the paper we will denote by 6 the vector of all kernel parameters. 

When making predictions, using a point estimate of 9 has been reported to poten¬ 
tially underestimate the uncertainty in p redictions or yield inaccurate assessment of 
the r e lative influen ce of different features ( Filippone et al.l2012l : lFilir)r)one and Girolamil 
120141 : [B'ishopll20(y^ 'l. Therefore, a Bayesian approach is usually adopted to overcome 
these limitations, which entails characterizing the posterior distribution over covari¬ 
ance parameters. In order to do so, it is necessary to employ methods, such as 
MCMC, that require computing the marginal likelihood every time 6 is updated. 
We now discuss the challenges associated with the computation of the marginal like¬ 
lihood for the special case of a Gaussian likelihood and the more general case of a 
non-Gaussian likelihood. 


2.1.1 Gaussian likelihood 


In the GP regression setting, the observations y are modeled to be Gaussian dis¬ 
tributed with mean of f (latent variables) and covariance AI, where I denotes the 
identity matrix and A is the variance of noise on the observations. In this setting, 
the likelihood p{y \ f) and the GP priors p{i \ 6) form a conjugate pair, so latent 
variables can be integrated out of the model leading to p[y \ X,0) ~ AA(0, C), where 
C = K -|- AI. This gives a direct access to the log-marginal likelihood 

log[p(y|0)] = -ilog|C| - iy^C^VTconst. 


Although computable, the log-marginal likelihood requires computing the log deter¬ 
minant of C and solve a linear system involving C. These calculations are usually 
carried out by factorizing the matrix C using the Cholesky decomposition, giving 
C = LL^, with L lower triangular. The Cholesky algorithm requires 0{n^) opera¬ 
tions, but after th at computing the terms of the m arginal likelihood requires at most 
0(pi?) operations ( Rasmussen and WilliamsiboOfil '). 


2.1.2 Non-Gaussian likelihood 

In the case of non-Gaussian likelihoods, the likelihood p{y \ f) and the GP prior 
p(f I 6) are no longer conjugate. As a consequence, it is not possible to solve the 
integral needed to integrate out latent variables 


p{y\0) = / p{_y\i)p{f\9)df 
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and this requires some form of approximation. A notable example is GP probit 
classification, which is what we explore in detail in this paper, where the observations 
y are assumed to be Bernoulli distributed with success probability given by: 

P{yi\fi)=’^iyifi) (3) 

For GPs with non-Gaussian likelihood, there have been several proposals on how to 
carry out approximation to integrate out the latent variables, or to avoid approxi¬ 
mations altogether. The focus of this paper is on methods that do not introduce any 
bias in the calculation of expectation under the posterior over covariance parameters, 
so we will discuss these approaches in detail in the next sections. 


2.2 Bayesian inference of covariance parameters 


For simplicity of notation, we will denote the posterior distribution over covariance 
parameters by 


Tv(0) :=p(0|y,X) 


p(y I 6)p{0) 

fp(y 1 0 )p{ 0 )de 


(4) 


where p{6) encodes any prior knowledge on the parameters 6. Within the Bayesian 
framework, we are usually interested in calculating expectations of functions of 6 with 
respect to the posterior distribution, i.e., For instance, setting h{6) = 

p{yi, I 0,x*,y,X), we obtain the predictive distribution for the label i/* associated 
with a new input vector x*. 

The denominator needed to normalize the posterior distribution tt{0) is in¬ 
tractable, so it is not possible to characterize the posterior distribution analytically. 
Despite this complication, it is possible to resort to a Monte Carlo approximation to 
compute expectations under the posterior distribution of 0: 


2 = 1 


(5) 


where 0^*^ denotes the ith of N samples from tv{0). However, as it is usually not 
feasible to draw samples from tt{0) directly, usually MCMC methods are employed 
to generate samples from the posterior tt{0). 

An alternative way to compute expectations, is by means of importance sampling, 
which takes the following form: 

E^^e)[He)] = jHe)^qm0 ( 6 ) 

where q(0) is the importance distribution. The corresponding Monte Carlo approx¬ 
imation is of the form: 


Ki0)[hm 


1 

N 


2 = 1 


7r(0W) 


(7) 


where now the samples are drawn from the importance sampling distribution 
q(0). The key to make this Monte Carlo estimator accurate is to choose q{0) to be 
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similar to the function that needs to be integrated. It is easy to verify that when 
q(0) is proportional to the function that needs to be integrated, the variance of 
the importance sampling estimator is zero. Therefore, the success of importance 
sampling relies on constructing a tractable importance distribution q{6) that well 
approximates h{0)'K{6). In the remainder of this paper we will discuss and employ 
methods that adaptively construct q{0). 

Both Monte Carlo approximations in equations 0 and Q converge to the de¬ 
sired expectation, and in practice, they can estimate the desired integral to a given 
level of precision ( Gelman and Rubinlll992al : iFlegal et al.l[2f)07^ . The experimental 
part of this work is devoted to the study of the convergence properties of the expec¬ 
tation 15,7(0) [^(^)] with respect to the computational effort needed to carry out the 
Monte Carfo approximations in Equations © and Q. 


2.3 Pseudo-Marginal MCMC for inference of covariance parameters 


Difficulties in computing the expectation in equation m arise when the likelihood 
is not Gaussian, thus leading to the the impossibility to calculate the marginal like¬ 
lihood exactly, which is necessary to employ standard MCMC algorithms to draw 
from the posterior n{0). In cases where the marginal likelihood can be unbiasedly 
estimated, it is possible to resort to so called Pseudo-Marginal MCMC approaches. 
Taking the Metropolis-Eastings algorithm as an example, it is possible to replace 
the exact calculation of the Hastings ratio 

p{y I G')p{0') 
p{y I O)p{0) 

by an approximation where the marginal likelihood is only unbiasedly estimated 

p{y I o')p{o') 

p{y I o)p{0) 

where p(y | 0) denotes such an approximation. Interestingly, the introduction of this 
approximation does not affect the properties of the MCMC approach that stiff yields 
samples from the correct posterior 7r(0). The effect of introducing this approximation 
is that the efficiency of the corresponding MCMC approach is reduced; this is due to 
the possibility that the algorithm accepts a proposal with an largely overestimated 
value of the marginal likelihood, making it difficult for any new proposals to be 
accepted. 

By inspecting the GP marginal likelihood 

p{y I I 


we observe that we can attempt to unbiasedly estimate this integral using importance 
sampling: 


p{y I 0) 


1 I fzMf. I 

Nimp ^ q{ii I y,0) 


( 9 ) 


Here Nimp is the number of samples fi drawn from the importance density q(i \ 
y,0). The motivation for attempting this approximation is that in the literature 
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there have been many efforts to construct accurate approximations to the posterior 
distribution over the latent variables p{f\y,6) ocp(y | fi)p{fi \ 0). This suggests that 
we can hope that the resulting estimator for the marginal likelihood is accurate 
enough to introduce an acceptable amount of noise in the calculation of the Hastings 
ratio to make the resulting MCMC approach reasonably efficient. In this paper, 
we investigate approximations q{f \ y,0) to the posterior obtained by the Laplace 
Appr oximation and Expectation Propagation algorithms ( Rasmussen and William j 

l2006lb 


3 Adaptive Multiple Importance Sampling for Gaussian Processes 


Inefficiencies arising from the use of MCMC methods to sample from the posterior 
distribution over covariance parameters are due to fact that several proposals are 
rejected. In order to mitigate this issue, some adaptation mechanisms of the proposals 
can be used based on previous MCMC samples, but the chain resulting from the 
adaptivity is no longer Markovian. As a result, elaborate erg odicity results are needed 
to establish convergence t o the true posterior distribution ( Haario et al ] ll99ill : l200ll : 
lAndrieu and Robertll200lh. 


In response to this, Cappe et al.l ( 20041 ') proposed a universal adaptive sampling 
scheme called populat ion Monte Carlo (P MC), where the difference from Sequential 
Monte Carlo (SMC) (|Doucet et al.l [ 2 OO 1 I 1 is that the target distribution becomes 
static. This method is reported to have better adaptivity than MCMC due to the 
fact that the use of importance sampling makes e rgodicity no t an issue. At each 
iteration of PMC, sampling importance resampling ( R,ubinl[l988l ~) is used to generate 
samples that are assumed to be marginally distributed from the target distribution 
and hence the approach is unbiased and can be stopped at any time. Moreover, the 
importance distribution can be adap ted using pa r t (gene rated at each iteration) or all 
of the importance sample sequence. IDouc et ] (I2n07all0l also introduced updating 
mechanisms for the weights of the mixture in the so called D-kernel PMC, which 
leads to a reduction either in Kullback divergence between the mixture and the 
target distribution or in the asymptotic variance for a f unction of intere s t. An earlier 
ad aptive importance sam pling strategy is illustrated in lOh and BergeJ (Il992h . 

ICornuet et al.l ( 2012l f proposed a new perspective of adaptive importance sam¬ 
pling (AIS), called adaptive multiple importance sampling (AMIS), where the dif¬ 
ference with the aforementioned PMC methods is that the importance weights of all 
simulations, produced previously as well as currently, are re-evaluated at each itera- 
tion. This method follo ws the “deterministic multiple mixture” sampling scheme of 
lOwen and Zhoul ( 2000l ') . The corresponding importance weight takes the form 






T-1 

J2Ntqt{0h^t) 

t=0 


( 10 ) 


where T is the total number of iterations, /(.) denotes the target distribution 7 r(.) 
up to a constant, i.e., 7r(.) oc /(.), qt{.) denotes the importance density at iteration 
t with sequentially updated parameters 7 ), 0 * are samples drawn from qt{.) with 
0<f<T-l, l<i<Nt. 

The fixed denominator in CQl) gives the name “deterministic multiple mixture”. 
The motivation is that this construction can achieve an upper bou nd on the asymp- 
totic variance of the estimator without rejecting any simulations ( Owen and Zhoul 
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Algorithm 1 Generic AMIS as analyzed bv lCornuet et all ll2012f) 


— At iteration t = 0, 

1. Generate Nq independent samples 0^{1 <i < Nq) from the initial importance density 
Qo[8]^) 

2. For 1 < i < No, compute 5? = Noqo{0°-, ^),w9 = /(0°) jqo{0°\^o) 

3. Estimate 7 ^^ of gi(0;7i) using the weighted samples and a 

well-chosen efficiency criterion for estimation. 

— At iteration t = l, T— 1, 

1. Generate Nt independent samples 0^(1 < i < Nt) from 

2. For 1 < i < A’t, compute the multiple mixture at 6^ 

t 

Sj = Noqo{9ino) + '^^im{9i-nt) 

1 = 1 


and derive the importance weights of 6^ 

t 

j^Q 

3. For 1 < / < t — 1 and 1 < i < A’z, update the past importance weights as 

t 

&\^&\ + Ntqt{9[-,^^) and tn'^/(»')/[5*/^iV,-] 

J=0 


4. Estimate using all the weighted samples 

and the same efficiency criterion for estimation. 


bOOnl '). In AMIS, the parameters 7 of a parametric importance function gt(0;7) 
are sequentially updated using the entire sequence of weighted importance samples, 
based on efficiency criteria such as moment matching, minimum Kullback diver- 
gence with respect t o the target or minimum variance of the weights (see, e.g., 
lOrtiz and Kaelblind (l200d f for stochastic gradient-based optimization of these effi¬ 
ciency criteria). This leads to a sequence of importance distributions, qi{0\^), 
Algorithm 1 gives the pseudo code of the generic AMIS algorithm. 

In this paper, we use a Gaussian importance density with mean fi and covariance 
S, i.e., 7 ( = (/x‘,X'*). We also choose moment matching as the efficiency criterion 
to estimate y = [ft* ) using the self-normalized AMIS estimator: 




and 






Despite the efficiency brought by AMIS compared with other AIS techniques, 
proving convergence of this algorithm is not straightforward. The work in iMarin et al.l 
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(|2ni4h proposed a modified version of AMIS called MAMIS, aiming at obtaining a 
variant of AMIS where convergence can be more easily established. The difference 
is that the new parameters are estimated based only on samples produced at 
iteration t, i.e., with classical weights f {6\)/. Then the weights 

of all simulations are updated according to (IIOII to give the final output. The sample 
size Nt is suggested to grow at each iteration so as to improve the accuracy of 
MAMIS effectively solves any convergence issues of AMIS, but convergence is slower 
due to the fact that less samples are used to update the importance distribution. 


3.1 Pseudo-Marginal AMIS 

The above AMIS/MAMIS are designed for the general analytically computable 
marginal likelihood such as in the case of GP regression. In this paper, we pro¬ 
posed using AMIS to sample from the posterior where the likelihood is analytically 
intractable but can be unbiasedly estimated. In this case, we can attempt employing 
AMIS replacing the exact calculation of the marginal likelihood by an unbiased esti¬ 
mate, obtaining an unbiased estimate of the posterior up to a normalizing constant: 

ho)=P{y\0)p{9) ( 11 ) 

We refer to this as Pseudo-Marginal AMIS (PM-AMIS), inspired by the name Pseudo- 
Marginal MCMC that was given to the class of MC MC algorithms that replace exact 
calculations of the likelihood by unbiased estimates (lAndrieu and Robertsi2009l '). The 
pseudo-code of PM-AMIS is similar to that of AMIS described in Algorithm 1 except 
that the target distribution up to a constant f{6) = p(y | 0)p{6) is replaced by the 
above unbiased estimate f{0). 

It is known that despite the fact that calculations are approximate, Pseudo- 
Marginal MCMC methods yields samples from the correct posterior distribution 
over covariance parameters, so a natural question is whether the same arguments 
hold for our proposal. In the remainder of this section, we provide an analysis of the 
properties of Pseudo-Marginal AMIS, discussing the conditions under which it yields 
expectations with respec t to the pos t erior distribution over covariance parameters 
that are unbiased. As in iTran et ^ (l2014l l IPitt et all (120121 ') , we introduce a ran¬ 
dom variable z whose distribution (denoted by p{z \ 6) herein) is determined by the 
randomness occurring when carrying out the unbiased estimation of the likelihood 
p{y I 0). Dehne: 

z = logp(y|0)-logp(y|0) (12) 

i.e. 

p{y \ G) = p{y \ 9)e^ (13) 

Due to the unbiased property {E\p{y \ 9)] = p{y \ 0)), we readily verify that 
E[e^\ = 1. For the sake of clarity, it is useful to define the unnormalized joint density 
of z and 9 as: 

f{z, 9) = p{y I 0)e^p(z | 9)p{9) (14) 

with a corresponding normalized version 


fiz,9) 


Tr(z,9) 


Z 


(15) 
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Marginalizing this joint density with respect to 2 


7 r(^, 6)dz 


Z Z Z 


(16) 


yields the target posterior 7r(0) of interest defined in Equation Q. 

Recall that our objective is analyzing expectations under the posterior over the 
parameters 7r(0) of some function h{0) 

h(e)Tr{e)de = J h{e):!^de (i7) 

We begin our analysis by substituting Equation m into Equation (HU), obtaining 

E^^e)[He)] = Jh{e)l^dedz (is) 

In PM-AMIS, let Nt denote the number of samples generated at each iteration 
t, qt{0) denote the importance density at each iteration for 7r(0). We also dehne 



qt{z,e)=p{z\e)qt{e) ( 19 ) 

as the joint importance density at each iteration for 7r(z, 0), {z\, 6\) as samples drawn 
from qt{z,6) with 0<t<T,l<i<Nt. 

Since in a practical setting f{z, 0) is the only function that we can evaluate, the 
expectation defined in Equation (fTSl) is estimated by the self-normalized PM-AMIS 
estimator: 

, T Nt 

(20) 

Z^t=0 Z^2=l t=0 i=l 

where the weights of this estimator are computed as 


Wi* = 


mM) 


EU 


AT, 




( 21 ) 


Expanding the terms in the computations of the weights, namely substituting 
Equation (El) (0 into Equation dSD, we have 


^ t ^ p(y I e\)e^lp{z\ I 0*)p(0*) 

' , ( 22 ) 

^ p(y I ^De^‘P(^*) 

ET=o (^i) 

2^j=o 

which can be rewritten in terms of the unbiased estimate of the marginal likelihood 
as 


p(y I 

l^j=0 


f{e\) 

El=0 

2-^j=o •? 


( 23 ) 
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Equation (ESI) shows how the importance weights can be computed by the unbiased 

estimator of the marginal likelihood^ _ 

We now appeal to Lemma 1 in ICornuet et alJ ( 20121 1 . which gives the conditions 
under which the self-normalized estimat or of AMIS wil l conv erge to Equation (fTTl) . 
Following the conditions in Lemma 1 in ICornuet et alJ (l2012l l. when T and Nq, 
Nt-i are fixed, and when Nt goes to infinity, Wi* (Equation (I21II 1 becomes: 




QTizlel) 


(24) 


Then we have 




T Nt 


^Nt 




1 


t=0 
T 


/ HO) 


f{z, 6 ) 

qT{z, 0 ) 


qriz, 6 )d 6 dz 




f{z,0) 


J2t = 0^t 
Lt=o^tt=o ^ 

= J h{e)l^de = E^^0)[hm 


dOdz 


where the normalizing constant Z is estimated by 


A^t=o 2^1=1 


Therefore, under the conditions that T and Nq, ..., Nt~i are hxe d and that Nt 


goes t o inhnity, which are the same conditions mentioned in Lemma 1 in ICornuet et al] 
( 2012h . the e stimator of Equation (I2UI) proves to be an unbiased estimator of E^( 0 ^ [h{0)\. 
As noted in ICornuet et al ] (l2012ll . we remark that these conditions might prove re¬ 
strictive in practice; however, these conditions provide some solid grounds onto which 
convergence can be established for AMIS. Furthermore, we note that in a practical 
setting, when in doubt as to whether convergence might be an issue, it is always 
possible to switch to the modified version of AMIS ( Marin et al ][2li) during exe¬ 
cution. 


4 Experiments 

4.1 Competing sampling methods 


In this section, we present the state-of-the-art MCMC and AIS sampling meth¬ 
ods considered in this work. The aim is to find out whether adaptive imporat- 
nce sampling (AMIS/MAMIS) can improve speed of convergenc e with respect to 
computational comp lexity co mpared to MCMC approaches (MH (|Metropolis et all 
I 1953 I: lHastingslll97Clll. HMC dPuane et al.lll987l : lNeallll99^. NU TS and NUTSDA 
( Hoffman and CelmanI 1201^ 1. SS (Slice Sampling) ( Neall l2003bl ll. The competing 
sampling algorithms considered in this work are given in Table [1] 
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Sampler 

Tuning parameters 

Metropolis-Hastings (MH) 

Covariance matrix S 

Hybrid Monte Carlo (HMC) 

Mass matrix U, Leapfrog stepsize e, Number 
of leapfrog steps L 

No-U-Turn Sampler (NUTS) 

Mass matrix U, Leapfrog stepsize e 

NUTS with Dual Averaging (NUTSDA) 

Mass matrix U 

Slice Sampling (SS) 

Width of the initial bracket 


Table 1: Competing sampling algorithms considered in this work. 


4.2 Datasets 

The sampling methods consi dered in this work are tested on six UCI datasets 
( Asuncion and NewmanI [20071 !. The Concrete, Housing and Parkinsons datasets are 
used for GP regression, whereas the Glass, Thyroid and Breast datasets are used 
for GP classihcation. The number of data points and the dimension of the features 
for each data point are given in Table [21 For the original Parkinsons dataset we 
randomly sampled 4 records for each of the 42 patients, resulting in 168 data points 
in total. 



Datasets for regression 

Datasets for classification 

Concete 

Housing 

Parkinsons 

Glass 

Thyroid 

Breast 

n 

1030 

506 

168 

214 

215 

682 

d 

8 

13 

20 

9 

5 

9 


Table 2: Datasets used in this paper; n denotes the number of data points, d denotes 
the dimension of the features. 


4.3 Experimental setup 
4 . 3.1 Settings for GP regression 

We compare three different covariances for the proposals of the MH algorithm. The 
first is proportional to the identity matrix. The second and third covariances are 
proportional to the inverse of the negative Hessian (denoted by H) evaluated at the 
mode (denoted by m); one uses the full matrix, whereas the other uses its diago¬ 
nal only, namely diag((— H)”^). The mode m is found by the maximum likelihood 
optimization using the “BFGS” method. 

Thus the proposals that we compare in this work take the form of Af(0 | m,al), 
Af(0 I m,a(— H)~^), and N'{9 \ m,a diag((—H)~^)), where a is a tuning parameter. 
We tune a i n pilot runs un t il we get the desired acceptance rate (around 25%), as 
suggested by [Roberts et al.l (ligOTi l. 

The approximate distribution J\f{0 \ m, (— H)^^) is used to be the initial impor¬ 
tance density for AMIS/MAMIS. This approximation is also used to initialize sev¬ 
eral independent sequences of samples from other samplers considered in this work 
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(listed in Table [T|). In this way, valid summary inferen ce from multiple independent 
seqnences can be obtained ( Gelman and R,ubinlll992al ~l. For AMIS/MAMIS, we ex¬ 
plored two kinds of update of the covariance of the importance density. One updates 
the full covariance, whereas the other updates only the diagonal of the covariance. 
The first two rows of Tabled] show the experimental settings for AMIS/MAMIS. 

Motivated by the fact that using knowledge of sc ales and co rrelation of the posi¬ 
tion variables can improve the performance of HMC (lNealll20lll l. we also chose three 
kinds of mass matrix for HMC, namely the identity matrix, the inverse of the ap¬ 
proximate covariance, and the inverse of the diagonal of the approximate covariance. 
We set the maximum leapfrog steps to be 10. We then tune the stepsize e until a sug¬ 
gested acceptance rate (around 65%) is reached ( Beskos et al.ll2013 : lNea l l201lh . The 
three forms of mass matrix apply to NUTS, NUTSDA as well; a full description of 
th e pseudo codes of these algori thms can be found in Algorithm 3 and 6 respectively 
in iHoffman and GelmarJ (l2012l 'l. NUTS requires the tuning of a stepsize e. After a 
few trials, we set the stepsize of NUTS to 0.1. Although tuning leapfrog steps and 
stepsize is not an issue in NUTSDA, the parameters ( 7 , 10 ,^) for the dual averaging 
scheme therein have to be tuned by hand to produce reasonable results. By trying 
a few settings for each parameter, hnally the values 7 = 0.05, fp = 30,/t = 0.75 were 
used in both the RBF and ARD kernel case. 

The slice sampling algorithm adopted in this paper makes component-wise up¬ 
dates of the parameters, where a new sampl e is drawn acc ording to the “stepping out” 
and “shrinkage” procedures as described in iNeall ( 2003bl l . In our implementation, we 
set the estimate of the typical size of a slice w to 1.5. 



RBF kernel 

ARD kernel 

T 

Nt 

T 

Nt 

AMIS 

1120 

25 

280 

100 

MAMIS 

46 

26t 

5 

3000 + 1000* 

PM-AMIS 

60 

400 

60 

400 


Table 3: Experimental settings for AMIS/MAMIS/PM-AMIS. T is the total number 
of iterations, Nt is the sample size at each iteration t. 


4 . 3.2 Settings for GP classification 


We compared only PM-AMIS and Pseudo-Marginal MH (PM-MH) in the case of 
GP classihcation. Since the likelihood is analytically intractable and thus is unbias- 
edly estimated, the critical property of reversibility and preservation of volume of 
HMC, NUTS, NUTSDA is no longer satisfied. Therefore, these algorithms are not 
considered in the GP classification case. It can be proved that slice sampling with 
the noisy estimate f{6) is still valid, but the adapta tion o f the sli ce estimate w such 
as the ’’stepping out” and ’’doubling” procedure in iNeall ( 2003lJ ) is not proper for 
this use case and naively running standa rd SS with the noisy estimate f(0) worked 
very poorly ( Murray and Graham 112013 ’). Consequently, SS is also not compared in 
this case. 

Both the EP and LA approximations are used as the importance densities to 
estimate the marginal likelihood. The last row of Table [3] shows the settings of PM- 
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AMIS in both the RBF and ARD cases except for the Breast dataset in the ARD 
case nsing LA approximation, where the total nnmber of iterations T is set to 240 
for the sake of presentation. The initial importance density is obtained by the same 
optimization method as described in Section fd. 3. II except that the gradient reqnired 
to perform the optimization cannot be computed analytically but is estimated from 
the EP or LA approximations. We update the full covariance of the importance den¬ 
sity during the adaptation process. The proposal of PM-MH also takes the form of 
A/'(0 I m,a(— H)~^) where H is the Hessian matrix obtained again from the approxi¬ 
mate marginal likelihood obtained by the EP or LA algorithms. We tuned a until we 
reached the recommended acceptance rate around 25% and use this tuned proposal 
to generate new samples. 


4.4 Convergence analysis 


Since the classic R score is for MCMC convergence analysis and not suitable for 
importance sampling, convergence analysis here is performed based on the IQR (in¬ 
terquartile range) of the expectation of norm of parameters x) [||^||]) over 

several repetitions against the number of 0{n^) operations; This is reported to be a 
more reliable measure of complexity than running time, as many other factors such 
as implementation details that do not relate directly to the actual comput ing com¬ 
plexity of the algorithms can affect the running time ( Filippone et al.ll2013l b For GP 
regression the IQR is computed over 100 replicates, whereas for GP classification it 
is based on 20 repetitions. 

For AMIS/MAMIS/SS/MH, the computing complexity lies in the computation of 
the function of f{0), where one 0(n^) operation is required to perform the Gholesky 
decomposition of the covariance matrix C. Whereas for HMG/NUTS/NUTSDA 
where computing the gradient is necessary, two extra O(n^) operations are needed 
for the computation of the inverse of the covariance matrix C. 

For PM-AMIS/PM-MH, the computing complexity largely comes from the EP 
or LA approximation of the posterior of the latent variables in order to compute the 
unbiased estimate f{9). Both EP and LA approximations take two 0{n^) operations 
to perform the Gholesky decomposition. One is for the Gholesky decomposition of the 
covariance matrix K of the GP prior, the other is for the Gholesky decomposition of 
the covariance of the approximating Gaussian. Each iteration of EP and LA requires 
three O(n^) operations and one O(n^) operation respectively. In LA approximation, 
two extra O(n^) operations are needed to compute the covariance of the Gaussian 
approximation. 


4.5 Results 

4-5.1 Notation of samplers 

Table |4] shows the notation for the different samplers considered in this paper. 

4-5.2 Convergence of samplers for GP regression 

In this section, we present the comparison of convergence of samplers for GP regres¬ 
sion considered in this paper. 
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AMIS/MAMIS 

AMIS/MAMIS for GP regression where the full covariance matrix of 
the proposal distribution is updated at each iteration 

AMIS-D/MAMIS-D 

AMIS/MAMIS for GP regression where only the diagonal of the 
covariance matrix of the proposal distribution is updated at each 
iteration 

MH-I 

MH for GP regression where the covariance of the starting proposal 
distribution for tuning is the identity matrix 

MH-D 

MH for GP regression where the covariance of the starting proposal 
distribution for tuning is the diagonal of the approximate covariance 
from the optimization 

MH-H 

MH for GP regression where the covariance of the starting proposal 
distribution for tuning is the approximate covariance from the 
optimization 

HMC-I/NUTS-I/NUTSD A-I 

HMG family for GP regression where the mass matrix is the identity 

matrix 

HMC-D/NUTS-D/NUTSDA-D 

HMC family for GP regression where the mass matrix is the inverse 
of the diagonal of the approximate covariance from the optimization 

HMC-H/NUTS-H/NUTSDA-H 

HMC family for GP regression where the mass matrix is the inverse 
of the approximate covariance from the optimization 

PM-AMIS 

AMIS for GP classification where the full covariance matrix of the 
proposal distribution is updated at each iteration 

PM-MH 

MH for GP classification where the covariance of the starting 
proposal distribution for tuning is the approximate covariance from 
the optimization 


Table 4: Notation for the samplers. 
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Fig. 1: Convergence of AMIS, Best of MAMIS, Best of MH family, Best of HMC 
family, SS for GP regression. 


Details of convergence results of AMIS family (AMIS/MAMIS), MH family (MH- 
I/MH-D/MH-H) and HMC family (standard HMC , NUTS, NUTSDA) can be found 
in and [B] Fig. [1] shows the final result of AMIS, best of MAMIS, best of MH 
family, best of HMC family and SS for the three datasets in both the RBF and 
ARD kernel case. It is interesting to see that AMIS/MAIMS performs best among 
all methods in terms of convergence speed in the RBF kernel case. In the ARD 
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kernel case, AMIS also converges much faster than the other approaches. However, 
our experiments show that in the ARD kernel case, although MAMIS converges 
faster than the other approaches in the Concrete dataset, it converges slowly in the 
Housing and Parkinsons datasets, which is probably due to the higher dimensionality 
compared to the previous cases. 

In cases where MAMIS converges slowly, we can exploit the fact that AMIS 
converges faster than MAMIS by running AMIS for a fixed number of iterations and 
then switch to MAMIS. In this way, we can ensure fast convergence of the adaptive 
scheme while ensuring that the scheme converges without issues. In the experiments, 
we tested this AMIS-MAMIS combination in cases where MAMIS converges slowly. 
We treated samples from AMIS as tuning cost for MAMIS to get an accurate initial 
importance density as is shown in bottom-right of Fig. [7] and Fig. [S]with EOT (end 
of tuning) indicated by the vertical dotted line. Three settings (Table O of AMIS- 
MAMIS were tested for the Parkinsons dataset. 



Nt for 
MAMIS 

★ number of 
tuning samples 
for MAMIS 

the corre¬ 
sponding 
tuning cost 

AMIS-MAMIS 

lOOOt 

13000 

4333 

AMIS-MAMIS’ 

SOOOi 

13000 

4333 

AMIS-MAMIS” 

sooot 

26000 

8667 


Table 5: Settings for AMIS-MAMIS. Nt is the sample size at each iteration t. * refers 
to the number of samples generated from AMIS for tuning the initial importance 
density of MAMIS. Unit of the tuning cost: number of operations. 


For the Housing dataset, we tested only AMIS-MAMIS in Tabled The results for 
the Housing and Parkinsons datasets in the ARD kernel case prove the convergence of 
AMIS-MAMIS. In particular, AMIS-MAMIS and AMIS-MAMIS” seem to compete 
well with the other MCMC approaches in terms of convergence for the Housing 
dataset and the Parkinsons dataset respectively. As is shown in the bottom-right of 
Fig. El the best performance of AMIS-MAMIS” for the Parkinsons dataset suggests 
that for higher dimensional problem, a more accurate initialization and a larger 
sample size at each iteration for MAMIS are necessary to achieve faster convergence. 


Another attempt that we made in this paper to improve convergence speed of 
the adaptive importance sampling schemes is to regulari ze the estimation of the pa - 
rameters of the importance distribution as illustrated in lSmidl and HofmanI ( 20l4 l. 
The regularization stems from the use of an informative prior on 7 of the impor¬ 
tance distribution q ti'j) of MAMI S and treat the update of these parameters in 
a Bayesian fashion ( Kulhav^llflflfil 'l. This construction makes it possible to avoid 
situations where the importance distribution degenerates to low rank due to few im¬ 
portance weights dominating all the others. In this work, we use an informative prior 
based on a Gaussian approximation to the posterior over covariance parameters. We 
denote this method by MAMIS-P and in the ARD kernel case it was tested only in 
the Housing dataset. The result indicates that even though MAMIS-P improves on 
MAMIS, its convergence is slower than AMIS-MAMIS (bottom-right of Fig. (TJ. 
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4-.5.3 Convergence of samplers for GP classification 

The comparison of convergence of samplers for GP classification (PM-AMIS and 
PM-MH) is presented in this section. 
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njmber of operations 


Thvroid - 

“ I 

1 

- iA'k.''' / f'V 

- PM-AMIS (LA) 

- - PM-MH (LA) 

■ • PM-AMIS (EP) 

• - PM-MH (EP) 

V \ • 

0 50000 150000 250000 

njmber of n^ operations 

Thyroid - ARD 

f 

1, 

- 

- PM-AMIS (LA) 

-PM-MH (LA) 

• • PM-AMIS (EP) 

• - PM-MH (EP) 


0 50000 150000 250000 



- PM-AMIS(LA) 

- - PM-MH (LA) 



Oe+00 1e+05 26<-05 3e+05 


njmber of n^ operations 


njmber of n^ operations 


Fig. 2: Convergence of Best of PM-AMIS, Best of PM-MH using EP optimization 
for GP classification. LA in the brackets indicates the case where the Gaussian 
approximation to the posterior of the latent variables used in the corresponding 
method is obtained by LA approximation, whereas EP in the brackets indicates the 
case where the Gaussian approximation is obtained by EP approximation. 


Fig. [2] shows the results of PM-AMIS and PM-MH using EP and LA approxi¬ 
mation with Nimp = 64, where Nimp denotes the number of importance samples of 
latent variables f to estimate the marginal likelihood p(y | 0). The results indicate 
that PM-AMIS is competitive with PM-MH in terms of convergence speed in all the 
EP approximation cases and in most of the LA approximation cases. The results 
also seem to suggest that PM-AMIS/PM-MH converge faster with EP approxima¬ 
tion than with LA approximation in most cases, which is probably because EP yield s 
a more accurate approximation than LA as reported in iKuss and Rasmussenl ( 2005ll . 
iNickisch and Rasmussenl ( 2008ll . We also tested the performance of PM-AMIS and 


PM-MH with Nimp = 1, the results of which are shown in As is seen from the 
hgures, both PM-AMIS and PM-MH algorithms with higher number of importance 
samples converge much faster than those with lower number of importance samples 
as expected. 


5 Conclusions 

In this paper we proposed the use of adaptive importance sampling techniques to 
carry out expectations under the posterior distribution of covariance parameters 
in Gaussian processes. The motivation for our proposal is based on a number of 
observations related to the complexity of dealing with the calculation of the marginal 
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likelihood. In GPs with a Gaussian likelihood, calculating the marginal likelihood and 
its gradient with respect to covariance parameters is expensive and standard MGMC 
algorithms reject proposals leading to a waste of computations. In GPs with a non- 
Gaussian likelihood, pseudo marginal MCMG approaches bypass the need to compute 
the marginal likelihood but may suffer from inefficiencies due to the fact that when a 
proposal is accepted and the estimated marginal likelihood is largely overestimated, 
it becomes difficult for the chain to accept any other proposal. A further motivation 
behind our proposal is that importance sampling-based algorithms are generally easy 
to implement and tune, and can be massively parallelized. 

The extensive set of results reported in this paper supports our intuition that 
importance sampling-based inference of covariance parameters is competitive with 
MGMC algorithms. In particular, the results indicate that it is possible to achieve 
convergence of expectations under the posterior distribution of covariance parameters 
faster than employing MGMC methods in a wide range of scenarios. Even in the case 
of around twenty parameters, where importance sampling based methods start to 
degrade in performance, our proposal is still competitive with MGMC approaches. 
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Appendices 


Appendices and m show the convergence results of the samplers for GP re¬ 
gression with the RBF covariance (RBF kernel case) and ARD covariance (ARD 
kernel case) respectively. The top-left of hgures in and |B] demonstrate the result 
of AMIS/MAMIS. It can be seen that AMIS/M AMIS that exploits the full covari¬ 
ance structure of the proposal distribution performs better than the one that only 
updates the diagonal of the covariance matrix of the proposal density. For the MH 
family (MH-I/MH-D/MH-H) and HMC family (standard BMC , NUTS, NUTSDA), 
figures in and m show that, the methods that make use of the scales and corre¬ 
lation of the parameters, perform better than the one that does not in most cases. 
Also, NUTS/NUTSDA turns out to converge much faster than the standard HMC 
due to the fact that standard HMC has to be tuned costly in pilot runs. For MH 
and standard HMC, the computational cost of tuning is counted when comparing 
the convergence, as is shown in top-center and top-right of figures in lAl and IB] where 
the end of tuning (EOT) is indicated by three vertical dotted lines, corresponding to 
the three variants respectively from left to right. For NUTSDA, the computational 
cost of tuning the parameters of the dual averaging scheme is also counted when 
determining the convergence, as is displayed in bottom-center of figures in |Al and 
m with EOT indicated by three vertical dotted lines, relating to the three variants 
respectively from left to right. Table [S] shows the corresponding computational cost 
of tuning: 



Concrete 

Housing 

Parkinsons 

RBF 

ARD 

RBF 

ARD 

RBF 

ARD 

HMC-I 

6747 

5910 

4779 

3924 

1561 

1340 

HMC-D 

6042 

7316 

7281 

7726 

8883 

8469 

HMC-H 

10851 

9451 

10987 

8860 

10871 

8736 

NUTDA-I 

1402 

3528 

1193 

7433 

1338 

6488 

NUTDA-D 

1357 

1582 

1124 

2424 

975 

1951 

NUTDA-H 

682 

1023 

670 

1866 

728 

1794 


Table 6: Computational cost of tuning for HMC/NUTSDA. Unit: number of 
operations. 


[cl shows the convergence results of PM-AMIS/PM-MH for the RBF (Fig.[9|) and 
ARD (Fig. llOl) case respectively. As can be seen from the figures, both PM-AMIS and 
PM-MH algorithms with higher number of importance samples (Nimp=64) converge 
much faster than those with lower number of importance samples (Nimp=l) in both 
EP and LA approximation cases as expected. The results also indicate that PM- 
AMIS is competitive with PM-MH in terms of convergence speed in most of the 
EP and LA approximation cases. Moreover, PM-AMIS/PM-MH seem to converge 
faster with EP approximation than with LA approximation in most cases which is 
probably because EP yields a more accurate approximation than LA as reported in 
IKuss and Rasmussen! ( 20051 ) , iNickisch and RasmusserJ ( 20081) . 
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A Convergence of samplers for GP regression with the RBF covariance 


Concrete dataset - RBF covariance 
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Fig. 3: Convergence of AMIS/MAMIS, MH, HMC, NUTS, NUTSDA for the Concrete 
dataset. EOT stands for ’’end of tuning”. 


Housing dataset - RBF covariance 
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Fig. 4: Convergence of AMIS/MAMIS, MH, HMC, NUTS, NUTSDA for the Housing 
dataset. EOT stands for ’’end of tuning”. 














































































24 


Xiaoyu Xiong et al. 


Parkinsons dataset - RBF covariance 
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Fig. 5: Convergence of AMIS/MAMIS, MH, HMC, NUTS, NUTSDA for the Parkin¬ 
sons dataset. EOT stands for ’’end of tuning”. 


B Convergence of samplers for GP regression with the ARD covariance 


Concrete dataset - AR.D covariance 
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Fig. 6: Convergence of AMIS/MAMIS, MH, HMC, NUTS, NUTSDA for the Concrete 
dataset. EOT stands for ’’end of tuning”. 
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Housing dataset - ARD covariance 
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Fig. 7: Convergence of AMIS/MAMIS, MH, HMC, NUTS, NUTSDA for the Housing 
dataset. EOT stands for ’’end of tuning”. 


Parkinsons dataset - ARD covariance 






Fig. 8: Convergence of AMIS/MAMIS, MH, HMC, NUTS, NUTSDA for the Parkin¬ 
sons dataset. EOT stands for ’’end of tuning”. 
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C Convergence of samplers for GP classification 
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Fig. 9: Convergence of PM-AMIS, PM-MH for the RBF case. LA indicates the Gaus¬ 
sian approximation to the posterior of latent variables f is obtained by LA approx¬ 
imation, whereas EP indicates the Gaussian approximation is obtained by EP ap¬ 
proximation. Nimp denotes the number of importance samples of latent variables to 
estimate the marginal likelihood p(y | 0 ). 
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Fig. 10: Convergence of PM-AMIS, PM-MH for the ARD case. LA indicates the 
Gaussian approximation to the posterior of latent variables f is obtained by LA 
approximation, whereas EP indicates the Gaussian approximation is obtained by EP 
approximation. Nimp denotes the number of importance samples of latent variables 
to estimate the marginal likelihood p(y | 6 ). 





























